5.1 Introduction to Asymptotics

1 Introduction

$(X_{i}, Y_{i})$ , $i = 1, \dots, n$ . $X_{i} \in R^{d}$ are continuous feature vector, fixed. $Y_{i} \overset{i . n . d}{\sim} Bernoulli (π_{β} (X_{i}))$ . Define $logit (π_{β} (X_{i})) = \log \frac{π_{β}}{1 - π_{β}} = β^{T} X_{i} .$ So $\begin{aligned} p_{β} (y | x) & = \prod_{i = 1}^{n} π_{β} (x_{i})_{i}^{y} (1 - π_{β} (x_{i}))^{1 - y_{i}} \\ = \prod_{i = 1}^{n} e^{(β^{T} x_{i}) y_{i} + \log (1 - π_{β} (x_{i}))} \\ = e^{β^{T} X^{T} Y + A (β; x_{i})} . \end{aligned}$
This is an exponential family with sufficient statistics $T (y) = X^{T} y$ and natural parameter $β$ .

In the above example, we want to test $H_{0} : β_{1} = 0$ . So we need to condition on $X_{∖ 1}^{T} y$ (by discuss in testing with nuisance parameters). But that would condition on $Y$ .
If we want to estimate $β$ , UMVU generically doesn't exist; Bayes needs prior on $β \in R^{d}$ .
Software packages use general purpose asymptotic methods: $\begin{aligned} {\hat{β}}_{MLE} (x, y) & = \arg max_{β \in R^{d}} p_{β} (y | x) \\ = \arg max_{β \in R^{d}} β^{T} X^{T} y - A (β; x_{i}) . \end{aligned}$
Asymptotically, ${\hat{β}}_{MLE} \approx N (β, J (β)^{- 1})$ (recall Fisher information).

$β$ is derived because ${\hat{β}}_{MLE}$ is unbiased, and we have $\nabla^{2} l (\hat{β}; X, y) \approx E_{β} [\nabla^{2} l (β; X, y)] = J (β)$ and $\hat{Σ} = (- \nabla^{2} l (\hat{β})^{- 1}) \approx Σ (β) = J (β)^{- 1} .$

So $Z_{j} = \frac{{\hat{β}}_{j} - β_{j}}{{\hat{σ}}_{j}} \approx N (0, 1)$ . And for test $H_{0} : β_{j} = 0$ , reject if $\frac{{\hat{β}}_{j}}{{\hat{σ}}_{j}}$ is large/small/extreme.
We can invert ^[1]: $| Z_{j} | < z_{α / 2} ⟺ β_{j} \in {\hat{β}}_{j} \pm z_{α / 2} {\hat{σ}}_{j}$ .

So far, everything has finite-sample, often using special properties of model $P$ (like exponential family) to do exact calculations.
For "general" models, exact calculations may be intractable or impossible. But we may be able to approximate our problem with a simpler problem in which calculations are easy.
Typically approximate by Gaussian, by taking limit of number of observations. But this is only interesting if approximation is good for "reasonable" sample size.

2 Probability Recall

2.1 Convergence

Let $X_{1}, X_{2}, \dots \in R^{d}$ be a sequence of random vectors. We care about two kinds of convergence:

Convergence in probability: $X_{n} \approx Const$ ;
Convergence in distribution: $X_{n} \approx N_{d} (0, I_{d})$ usually.

Theorem (Weak Convergence)

$X_{1}, X_{2}, \dots \in R$ , $F_{n} (x) = P (X_{n} \leq x), F (x) = P (X \leq x)$ , then $X_{n} \Rightarrow X$ iff $F_{n} (x) \to F (x), \forall x$ ( $F$ continuous at $x$ .)

X_{n} \sim δ_{\frac{1}{n}}

X \sim δ_{0}

, then

X_{n} \Rightarrow X

Proposition

$X_{n} \overset{p}{\to} c ⟺ X_{n} \Rightarrow δ_{c}$ .

Proof

" $\Leftarrow$ ": let $f_{ε} (x) = max (1, \frac{| | x - c | |}{ε}) \geq 1 {| | x - c | | > ε}$ . Then by Markov's inequality, $P (| | X_{n} - c | | > ε) \leq \frac{E | | X_{n} - c | |}{ε} \leq E f_{ε} (X_{n}) \to 0.$
" $\Rightarrow$ ": suppose $f$ bounded and continuous. $\forall ε > 0, \exists d (ε) > 0$ , s.t. $| | x - c | | \leq d (ε) \Rightarrow | f (x) - f (c) | \leq ε,$ so $\begin{aligned} E f (X_{n}) - f (c) \\ \leq & E [| f (X_{n}) - f (c) | \cdot (1 {| | X_{n} - c | | \leq d (ε)} + 1 {| | X_{n} - c | | > d (ε)})] \\ \leq & ε + P (| | X_{n} - c | | > d (ε)) \cdot sup_{x} | f (x) - f (c) | \\ \leq & 2 ε \cdot sup | f | . \end{aligned}$

In a sequence of statistical models $P_{n} = {P_{n, θ} : θ \in Θ}$ with $X_{n} \sim P_{n, θ}$ , we say $δ_{n} (X_{n})$ is consistent for $g (θ)$ if $δ_{n} (X_{n}) \overset{P_{θ}}{\to} g (θ),$ meaning $P_{θ} (| | δ_{n} (X_{n}) - g (θ) | | > ε) \to 0.$ Usually we omit the index $n$ , because sequence can be implicit.

2.2 Limit Theorems

Denote ${\overset{―}{X}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ . We are familiar with

WLLN : ${\overset{―}{X}}_{n} \overset{p}{\to} μ$ ; & SLLN: ${\overset{―}{X}}_{n} \overset{a . s .}{\to} μ$ .
CLT: $\sqrt{n} ({\overset{―}{X}}_{n} - μ) \Rightarrow N (0, ε)$ .

3 Continuous Mapping, Delta Method

Theorem (Continuous Mapping)

$g$ is continuous, $X_{1}, X_{2}, \dots$ is a sequence of RVs.

If $X_{n} \Rightarrow X$ , then $g (X_{n}) \Rightarrow g (X)$ .
If $X_{n} \overset{p}{\to} c$ , then $g (X_{n}) \overset{p}{\to} g (c)$ .

Proof

If $X_{n} \Rightarrow X$ , then $E f (g (X_{n})) \to E f (g (X))$ .
$X_{n} \overset{p}{\to} c$ is special case with $X \sim δ_{c}$ .

Theorem (Slutsky)

Assume $X_{n} \Rightarrow X, Y_{n} \overset{p}{\to} c$ . Then $X_{n} + Y_{n} \Rightarrow X + c$ , $X_{n} \cdot Y_{n} \Rightarrow c X$ . $\frac{X_{n}}{Y_{n}} \Rightarrow \frac{X}{c}$ if $c \neq 0$ .

Proof

Show $(X_{n}, Y_{n}) \Rightarrow (X, c)$ , and apply continuous mapping.

Theorem (Delta Method)

If $\sqrt{n} (X_{n} - μ) \Rightarrow N (0, σ^{2})$ , and $f (x)$ is differentiable at $x = μ$ . Then $\sqrt{n} (f (X_{n}) - f (μ)) \Rightarrow N (0, \dot{f} (μ)^{2} σ^{2})$ .

Informal statement:
$X_{n} \approx N (μ, \frac{σ^{2}}{n})$ , then $f (X_{n}) \approx N (f (μ), \dot{f} (μ) \frac{σ^{2}}{n})$ .

Proof

Perform Taylor expansion $f (X_{n}) = f (μ) + \dot{f} (μ) (X_{n} - μ) + o (X_{n} - μ),$ then $\begin{aligned} \sqrt{n} (f (X_{n}) - f (μ)) & = \dot{f} (μ) \cdot \sqrt{n} (X_{n} - μ) + \sqrt{n} \cdot o (X_{n} - μ) \\ = N (0, \dot{f} (μ)^{2} σ^{2}) . \end{aligned}$
For multivariate case, $\sqrt{n} (X_{n} - μ) \Rightarrow N_{d} (0, Σ)$ , $f : R^{d} \to R^{k}$ . Derivative $D f (x) = (\begin{matrix} - \nabla f_{1} (x) - \\ ⋮ \\ - \nabla f_{k} (x) - \end{matrix})$ exists at $μ$ , then $\begin{aligned} \sqrt{n} (f (X_{n}) - f (μ)) \approx & \sqrt{n} D f (μ) (X_{n} - μ) \\ \approx & N_{k} (0, D f (μ) Σ D f (μ)^{T}) \\ = & N (0, \nabla f (μ)^{T} Σ D f (μ)) if k = 1. \end{aligned}$

Example

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} (μ, σ^{2}), Y_{1}, \dots, Y_{n} \overset{i . i . d}{\sim} (ν, τ^{2})$ , $X ⊥ ⊥ Y$ . For large $n$ , find the distribution of $(\overset{―}{X} + \overset{―}{Y})^{2}$ .

$\overset{―}{X} \overset{p}{\to} μ, \overset{―}{Y} \overset{p}{\to} ν$ , then $(\overset{―}{X} + \overset{―}{Y})^{2} \overset{p}{\to} (μ + ν)^{2}$ .
$\sqrt{n} (\overset{―}{X} - μ) \Rightarrow N (0, σ^{2})$ , $\sqrt{n} (\overset{―}{Y} - ν) \Rightarrow N (0, τ^{2})$ . Let $f (x, y) = (x + y)^{2}$ , then $\frac{\partial f}{\partial x} (x, y) = \frac{\partial f}{\partial y} (x, y) = 2 (x + y) .$ Then $\begin{aligned} f (\overset{―}{X}, \overset{―}{Y}) & \approx N (f (μ, ν), \frac{1}{n} \nabla f^{T} (\begin{array}{c} σ^{2} & 0 \\ 0 & τ^{2} \end{array}) \nabla f) \\ = N ((μ + ν)^{2}, \frac{4}{n} (μ + ν)^{2} (σ^{2} + τ^{2})), \end{aligned}$ i.e. $\sqrt{n} ((\overset{―}{X} + \overset{―}{Y})^{2} - (μ + ν)^{2}) \Rightarrow N (0, 4 (μ + ν)^{2} (σ^{2} + τ^{2})) .$
If $(μ + ν)^{2} = 0$ , then $\sqrt{n} (\overset{―}{X} + \overset{―}{Y})^{2} \overset{p}{\to} 0$ . Note that $\sqrt{n} (\overset{―}{X} + \overset{―}{Y}) \Rightarrow N (0, σ^{2} + τ^{2})$ , then $n (\overset{―}{X} + \overset{―}{Y})^{2} \Rightarrow (σ^{2} + τ^{2}) χ_{1}^{2} .$

In general, we can do higher-order Taylor expansions for delta method if derivatives is $0$ .

f (X_{n}) \approx f (μ) + f^{'} (μ) (X_{n} - μ) + \frac{f^{″} (μ)}{2} (X_{n} - μ)^{2} + \dots

If $f^{'} (μ) = 0$ , use second-order term $n (f (X_{n}) - f (μ)) \approx \frac{f^{″} (μ)}{2} (\sqrt{n} (X_{n} - μ))^{2} \approx \frac{f^{″} (μ) σ^{2}}{2} χ_{1}^{2} .$

Recall this section. ↩︎